{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using wrappers for Scikit learn API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial is about using gensim models as a part of your scikit learn workflow with the help of wrappers found at ```gensim.sklearn_integration```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The wrapper available (as of now) are :\n",
    "* LdaModel (```gensim.sklearn_integration.sklearn_wrapper_gensim_ldaModel.SklearnWrapperLdaModel```),which implements gensim's ```LdaModel``` in a scikit-learn interface\n",
    "\n",
    "* LsiModel (```gensim.sklearn_integration.sklearn_wrapper_gensim_lsiModel.SklearnWrapperLsiModel```),which implements gensim's ```LsiModel``` in a scikit-learn interface"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### LdaModel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use LdaModel begin with importing LdaModel wrapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from gensim.sklearn_integration.sklearn_wrapper_gensim_ldamodel import SklearnWrapperLdaModel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we will create a dummy set of texts and convert it into a corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from gensim.corpora import Dictionary\n",
    "texts = [['complier', 'system', 'computer'],\n",
    " ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],\n",
    " ['graph', 'flow', 'network', 'graph'],\n",
    " ['loading', 'computer', 'system'],\n",
    " ['user', 'server', 'system'],\n",
    " ['tree','hamiltonian'],\n",
    " ['graph', 'trees'],\n",
    " ['computer', 'kernel', 'malfunction','computer'],\n",
    " ['server','system','computer']]\n",
    "dictionary = Dictionary(texts)\n",
    "corpus = [dictionary.doc2bow(text) for text in texts]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then to run the LdaModel on it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([[ 0.85275314,  0.14724686],\n",
       "       [ 0.12390183,  0.87609817],\n",
       "       [ 0.4612995 ,  0.5387005 ],\n",
       "       [ 0.84924177,  0.15075823],\n",
       "       [ 0.49180096,  0.50819904],\n",
       "       [ 0.40086923,  0.59913077],\n",
       "       [ 0.28454427,  0.71545573],\n",
       "       [ 0.88776198,  0.11223802],\n",
       "       [ 0.84210373,  0.15789627]])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model=SklearnWrapperLdaModel(num_topics=2,id2word=dictionary,iterations=20, random_state=1)\n",
    "model.fit(corpus)\n",
    "model.print_topics(2)\n",
    "model.transform(corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "#### Integration with Sklearn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To provide a better example of how it can be used with Sklearn, Let's use CountVectorizer method of sklearn. For this example we will use [20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/). We will only use the categories rec.sport.baseball and sci.crypt and use it to generate topics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from gensim import matutils\n",
    "from gensim.models.ldamodel import LdaModel\n",
    "from sklearn.datasets import fetch_20newsgroups\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from gensim.sklearn_integration.sklearn_wrapper_gensim_ldamodel import SklearnWrapperLdaModel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "rand = np.random.mtrand.RandomState(1) # set seed for getting same result\n",
    "cats = ['rec.sport.baseball', 'sci.crypt']\n",
    "data = fetch_20newsgroups(subset='train',\n",
    "                        categories=cats,\n",
    "                        shuffle=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we use countvectorizer to convert the collection of text documents to a matrix of token counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vec = CountVectorizer(min_df=10, stop_words='english')\n",
    "\n",
    "X = vec.fit_transform(data.data)\n",
    "vocab = vec.get_feature_names() #vocab to be converted to id2word \n",
    "\n",
    "id2word=dict([(i, s) for i, s in enumerate(vocab)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we just need to fit X and id2word to our Lda wrapper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  u'0.025*\"456\" + 0.021*\"argue\" + 0.016*\"bitnet\" + 0.015*\"beastmaster\" + 0.014*\"cryptography\" + 0.013*\"false\" + 0.012*\"digex\" + 0.011*\"cover\" + 0.011*\"classified\" + 0.010*\"disk\"'),\n",
       " (1,\n",
       "  u'0.142*\"abroad\" + 0.113*\"asking\" + 0.088*\"cryptography\" + 0.044*\"ciphertext\" + 0.043*\"arithmetic\" + 0.032*\"courtesy\" + 0.030*\"facts\" + 0.021*\"argue\" + 0.019*\"amolitor\" + 0.018*\"agree\"'),\n",
       " (2,\n",
       "  u'0.034*\"certain\" + 0.027*\"69\" + 0.025*\"book\" + 0.025*\"demand\" + 0.024*\"87\" + 0.024*\"cracking\" + 0.021*\"farm\" + 0.019*\"fierkelab\" + 0.015*\"face\" + 0.011*\"abroad\"'),\n",
       " (3,\n",
       "  u'0.017*\"decipher\" + 0.017*\"example\" + 0.016*\"cases\" + 0.016*\"follow\" + 0.008*\"considering\" + 0.006*\"forgot\" + 0.006*\"cellular\" + 0.005*\"evans\" + 0.005*\"computed\" + 0.005*\"cia\"'),\n",
       " (4,\n",
       "  u'0.022*\"accurate\" + 0.021*\"corporate\" + 0.013*\"chance\" + 0.012*\"clark\" + 0.009*\"consideration\" + 0.009*\"candidates\" + 0.008*\"dawson\" + 0.008*\"authentication\" + 0.008*\"assess\" + 0.008*\"attempt\"')]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj=SklearnWrapperLdaModel(id2word=id2word,num_topics=5,passes=20)\n",
    "lda=obj.fit(X)\n",
    "lda.print_topics()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "#### Example for Using Grid Search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection  import GridSearchCV\n",
    "from gensim.models.coherencemodel import CoherenceModel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def scorer(estimator, X,y=None):\n",
    "    goodcm = CoherenceModel(model=estimator, texts= texts, dictionary=estimator.id2word, coherence='c_v')\n",
    "    return goodcm.get_coherence()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=5, error_score='raise',\n",
       "       estimator=SklearnWrapperLdaModel(alpha='symmetric', chunksize=2000, corpus=None,\n",
       "            decay=0.5, eta=None, eval_every=10, gamma_threshold=0.001,\n",
       "            id2word=<gensim.corpora.dictionary.Dictionary object at 0x7f42ccbebd10>,\n",
       "            iterations=50, minimum_probability=0.01, num_topics=5,\n",
       "            offset=1.0, passes=20, random_state=None, update_every=1),\n",
       "       fit_params={}, iid=True, n_jobs=1,\n",
       "       param_grid={'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)},\n",
       "       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,\n",
       "       scoring=<function scorer at 0x7f42cad12230>, verbose=0)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj=SklearnWrapperLdaModel(id2word=dictionary,num_topics=5,passes=20)\n",
    "parameters = {'num_topics':(2, 3, 5, 10), 'iterations':(1,20,50)}\n",
    "model = GridSearchCV(obj, parameters, scoring=scorer, cv=5)\n",
    "model.fit(corpus)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'iterations': 20, 'num_topics': 3}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.best_params_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example of Using Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn import linear_model\n",
    "def print_features_pipe(clf, vocab, n=10):\n",
    "    ''' Better printing for sorted list '''\n",
    "    coef = clf.named_steps['classifier'].coef_[0]\n",
    "    print coef\n",
    "    print 'Positive features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[::-1][:n] if coef[j] > 0]))\n",
    "    print 'Negative features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[:n] if coef[j] < 0]))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "id2word=Dictionary(map(lambda x : x.split(),data.data))\n",
    "corpus = [id2word.doc2bow(i.split()) for i in data.data]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ -2.95020466e-01  -1.04115352e-01   5.19570267e-01   1.03817059e-01\n",
      "   2.72881013e-02   1.35738501e-02   1.89246630e-13   1.89246630e-13\n",
      "   1.89246630e-13   1.89246630e-13   1.89246630e-13   1.89246630e-13\n",
      "   1.89246630e-13   1.89246630e-13   1.89246630e-13]\n",
      "Positive features: Fame,:0.52 Keach:0.10 comp.org.eff.talk,:0.03 comp.org.eff.talk.:0.01 >Pat:0.00 dome.:0.00 internet...:0.00 trawling:0.00 hanging:0.00 red@redpoll.neoucom.edu:0.00\n",
      "Negative features: Fame.:-0.30 considered,:-0.10\n",
      "0.531040268456\n"
     ]
    }
   ],
   "source": [
    "model=SklearnWrapperLdaModel(num_topics=15,id2word=id2word,iterations=50, random_state=37)\n",
    "clf=linear_model.LogisticRegression(penalty='l2', C=0.1) #l2 penalty used\n",
    "pipe = Pipeline((('features', model,), ('classifier', clf)))\n",
    "pipe.fit(corpus, data.target)\n",
    "print_features_pipe(pipe, id2word.values())\n",
    "print pipe.score(corpus, data.target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### LsiModel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use LsiModel begin with importing LsiModel wrapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from gensim.sklearn_integration.sklearn_wrapper_gensim_lsimodel import SklearnWrapperLsiModel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example of Using Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.13652819  0.00383696  0.02635504 -0.08454895 -0.02356143  0.60020084\n",
      "  1.07026252 -0.04072257  0.43732847  0.54913549 -0.20242834 -0.21855402\n",
      " -1.30546283 -0.08690711  0.17606255]\n",
      "Positive features: 01101001B:1.07 comp.org.eff.talk.:0.60 red@redpoll.neoucom.edu:0.55 circuitry:0.44 >Pat:0.18 Fame.:0.14 Fame,:0.03 considered,:0.00\n",
      "Negative features: internet...:-1.31 trawling:-0.22 hanging:-0.20 dome.:-0.09 Keach:-0.08 *best*:-0.04 comp.org.eff.talk,:-0.02\n",
      "0.865771812081\n"
     ]
    }
   ],
   "source": [
    "model=SklearnWrapperLsiModel(num_topics=15, id2word=id2word)\n",
    "clf=linear_model.LogisticRegression(penalty='l2', C=0.1) #l2 penalty used\n",
    "pipe = Pipeline((('features', model,), ('classifier', clf)))\n",
    "pipe.fit(corpus, data.target)\n",
    "print_features_pipe(pipe, id2word.values())\n",
    "print pipe.score(corpus, data.target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
